Clitics in Arabic Language: A Statistical Study

نویسندگان

  • Fahad Alotaiby
  • Salah Foda
  • Ibrahim Alkharashi
چکیده

Clitics in Arabic language can be attached to a stem or to each other without orthographic marks such as an apostrophe. In this paper we present a statistical study of clitics and its effect in Arabic language. We tokenize large Arabic text using white-spaces and an automatic clitics tokenizer (AMIRA 2.0) and compare the unique-word count in both cases with English language. We also show the resulted distribution of clitics in Arabic and examine the performance of the used tokenizer. Using a 600 million words Arabic corpus, we report that the corresponding lexicon size could be reduced by 24.54% when applying clitics tokenization.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Constructing and Using Broad-coverage Lexical Resource for Enhancing Morphological Analysis of Arabic

Broad-coverage language resources which provide prior linguistic knowledge must improve the accuracy and the performance of NLP applications. We are constructing a broad-coverage lexical resource to improve the accuracy of morphological analyzers and part-of-speech taggers of Arabic text. Over the past 1200 years, many different kinds of Arabic language lexicons were constructed; these lexicons...

متن کامل

Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation

This paper is about improving the quality of Arabic-English statistical machine translation (SMT) on dialectal Arabic text using morphological knowledge. We present a light-weight rule-based approach to producing Modern Standard Arabic (MSA) paraphrases of dialectal Arabic out-of-vocabulary (OOV) words and low frequency words. Our approach extends an existing MSA analyzer with a small number of...

متن کامل

Abstracts/Journal of the Arabic Language and LiteratureVol.14, No48, autumn 2018

Contents The Representation of Culture in Arabic pedagogy books to non-Arabic languages Danesh Mohammadi, Sakineh Zarenejad....................................................... 1 Critical Study of the‏‏manifestations of Mamluke's life from the novel “Alsaeroun‏‏niyam...

متن کامل

Formulation of Language Teachers̕ Identity in the Situated Learning of Language Teaching Community of Practice

A community of practice may shape and reshape the identity of members of the community through providing them with situated learning or learning environment. This study, therefore, is to clarify the salient learning-based features of the language teaching community of practice that might formulate the identity of language teachers. To this end, the study examined how learning situations in two ...

متن کامل

Automatic Headline Generation using Character Cross-Correlation

Arabic language is a morphologically complex language. Affixes and clitics are regularly attached to stems which make direct comparison between words not practical. In this paper we propose a new automatic headline generation technique that utilizes character cross-correlation to extract best headlines and to overcome the Arabic language complex morphology. The system that uses character cross-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010